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Abstract —In order to track the moving objects in long range 
against occlusion, interruption, and background clutter, this 
paper proposes a unified approach for global trajectory analysis. 
Instead of the traditional frame-by-frame tracking, our method 
recovers target trajectories based on a short sequence of video 
frames, e.g. 15 frames. We initially calculate a foreground map 
at each frame, as obtained from a state-of-the-art background 
model. An attribute graph is then extracted from the foreground 
map, where the graph vertices are image primitives represented 
by the composite features. With this graph representation, we 
pose trajectory analysis as a joint task of spatial graph partition¬ 
ing and temporal graph matching. The task can be formulated 
by maximizing a posteriori under the Bayesian framework, 
in which we integrate the spatio-temporal contexts and the 
appearance models. The probabilistic inference is achieved by 
a data-driven Markov Chain Monte Carlo (MCMC) algorithm. 
Given a peroid of observed frames, the algorithm simulates a 
ergodic and aperiodic Markov Chain, and it visits a sequence 
of solution states in the joint space of spatial graph partitioning 
and temporal graph matching. In the experiments, our method 
is tested on several challenging videos from the public datasets 
of visual surveillance, and it outperforms the state-of-the-art 
methods. 

Index Terms —Trajectory Analysis, Multiple Object Tracking, 
Graph Partitioning and Matching, Video Surveillance. 

1. Introduction 

Video object tracking is a fundamental problem in the 
academic research of image/video processing and computer 
vision, involving two key issues: (i) extracting objects of inter¬ 
est from backgrounds and (ii) establishing correspondences of 
objects over video frames. Trajectory parsing and analysis for 
multiple targets is a further task upon target tracking, and plays 
a critical role in the recently-arising intelligence applications, 
such as robotics i) and video surveillance systems (251, E3. 
(341 . It is also an important support for higher level video 
retrieval and event analysis (22l, (43l . The object of this work 
is to study a unified approach for trajectory analysis under 
the Bayesian framework. As Fig. illustrates, the input of 
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our algorithm is a short sequence of observed frames rather 
than a single frame, in which we localize the multiple moving 
targets and track them with their identities preserved; the 
global trajectories of targets for the whole video can be parsed 
through the inference. 

A. Related Work 

In the literature, video object tracking has been intensively 
studied and many effective methods have been proposed. For 
single-target tracking, various object appearance models and 
motion models are well exploited to estimate target state 
(location, velocity, etc.) (44l . (TTl . (24l . (20l . (34l. Recently, 
a class of techniques called “tracking by detection” has been 
shown to provide promising results O, (Z7]| . (311 . Q, (TSll . 
For multi-object tracking (i.e. trajectory analysis), which our 
method addresses, we shall identify multiple moving targets by 
associating correspondences between observations and objects 
as well as estimating the state of each target (32l, (^ . 

In general, we roughly categorize the work of trajectory 
analysis into two types: sequential inference based, and de¬ 
ferred inference based, in terms of the number of input frames 
for inference. 

(I) Sequential inference based methods use the information 
of the currently observed frame to predict the states of moving 
targets and assign their target identities. The classical examples 
are particle filtering (TSl, (34l . (TtI and optical flow (TOl . 
Recently, Avidan (U proposed a learning-based tracker using 
the online Adaboost algorithm, which maintains a discrimina¬ 
tive detector to track targets in the current frame. Babenko et 
al. O significantly improved the tracking performance using 
Multiple Instance Learning (MIL). Despite great success, 
these approaches may yield identity lossing (or switching) 
and trajectory fragmentation in terms of mutual-interaction, 
occlusion and spurious motion, because they make online 
decisions while discarding global information. 

(II) Deferred inference based methods, also referred as 
global data association based tracking, are to identify each 
observation with either a track ID or a false alarm in a short 
period of time, e.g. 15 frames. The observations, namely, 
moving blobs, can be obtained by using methods such as 
background subtraction. The first attempts on data association 
optimization are Multiple Hypothesis Tracker (35]| . (TtII . (9l . 
and Joint Probabilistic Data Association Filters n, which 
search the hypothesis (the associations of observations and 
targets) by assuming one-to-one mapping, i.e. one observa¬ 
tion to one target. Once this assumption is relaxed, e.g. a 
target consisting of a set of observations, the search space of 
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(a) (b) (c) 


Fig. I. Illustration of the trajectory analysis, (a) shows a batch of successive video frames as the input of our method, (b) shows a few results of multiple 
target tracking, where the numbers around the tracking ellipses imply the identities of targets, (c) visualizes the global trajectories of the video in a 3D 
perspective. 


optimization grows exponentially with the number of frames 
and targets. To overcome this problem, many deterministic 
optimal algorithms have been employed, such as Extended 
Dynamic Programming (3, ED, oa, Quadratic Boolean Pro¬ 
gramming 1191 , and Hierarchical Hungarian algorithm llT4ll . 
However, it is still impractical to apply these methods for intel¬ 
ligence surveillance systems, due to the following aspects |[3^ . 
1361 . 1251 . First, some approaches of trajectory analysis need 
good initializations, e.g., manually annotating targets or as¬ 
suming no conglutination at the beginning frame. Second, 
due to the ambiguity caused by the similar appearances of 
coupled targets, it is difficult to stably maintain the correct 
identities of targets with long term tracking. In the example 
in Fig. (a), the track IDs of targets are switched in the 
crowd scene 113. Third, the affinity model of a moving target, 
i.e. object representation, is not discriminative with respect 
to complex surrounding clutter, illumination and object scale 
changes, which often leads to false tracking or the splitting of 
one target into several pieces El, m, as the examples shown 
in Fig. (b) and (c). 



Fig. 2. A few typical challenges in trajectory analysis, (a) Due to the mutual 
interaction in the crowd scene, the track IDs of targets are switched, (b) The 
tracker is distracted by the background clutter, (c) The tracked target is split 
into several ones, due to illumination and object scale changes. 


B. Method Overview 

According to the literature review, the proposed approach 
belongs to deferred inference based methods. The goal of 


our approach is to parse trajectories of moving targets under 
the Bayesian framework, in which searching for the optimal 
trajectory solution is formulated as a problem of maximiz¬ 
ing a posterior probability (MAP). We briefly introduce our 
method in the following three aspects: a composite feature for 
matching affinity of moving targets, a spatio-temporal graph 
for representing the task of trajectory analysis, and an iterative 
stochastic algorithm for global inference. 

(I) In surveillance videos, particularly for some outdoor 
scenes, it is a critical issue to robustly recover correspondences 
over frames against illumination changes, drastic motion, etc. 
A consensus from a recent image feature research is that 
a good image feature for tracking demands two properties: (i) 
the discrimination, i.e. distinctive matching over frames, and 
(ii) the robustness, i.e. geometric-invariance, and tolerance of 
non-rigid motion, etc. In fact, these two properties sometimes 
conflict with each other. For example, one may increase the 
region size (scale) of a local feature and/or the dimensionality 
of the descriptor, but a larger feature is usually less robust 
in tracking with photometric and geometric changes. In this 
paper, we propose a composite image feature to represent 
moving targets. We employ two types of well-known image 
features, SURF (Si and MSER 1^ , in the composite features. 
Each composite feature is composed of a feature region 
generated by MSER detector within a set of SURF feature 
points. This scheme is similar with the Bundled Feature 1411 
proposed by Sun et al. 1411 for web image search, but we 
define a different matching metric to adapt object tracking. 

(II) Given the extracted composite features from the ob¬ 
served frames, we can build up a spatial graph and a temporal 
graph to pose the problem of trajectory analysis as a joint task 
of spatial graph partitioning and temporal graph matching. In 
the spatial graph, each graph vertex is a detected composite 
feature and each graph edge is defined by the appearance 
and motion consistency of the two adjacent vertices. In the 
temporal graph, each graph vertex implies one underlying 
target consisting of a connected cluster of composite features, 
and the graph edges denote the matching correspondences 
between targets in consecutive frames. With these graph rep¬ 
resentations, the task of graph partitioning corresponds with 
extracting and segmenting targets from background; the graph 
matching task is equivalent to establishing the correspondences 
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of targets over frames. We can further formulate these tasks 
by maximizing posterior probability under the Bayesian frame¬ 
work. In addition, two types of scene contexts are integrated 
as the informative prior, including: (i) target size prediction 
using scene geometric information, inspired by the previous 
work 1^ . ||32l, and (ii) target motion prior model by the path 
statistics. These types of prior knowledge are very informative 
to make the model robust and efficient. For example, with 
two people walking close together with similar appearances, 
our model tends to segment them into two individual targets 
according to the prior term of target size prediction. 

(Ill) It is a non-trivial optimization procedure to search for 
the maximum of the posterior probability with our formula¬ 
tion. There are many ambiguities caused by conglutinations, 
occlusions, and similar appearances of targets and background 
clutters in some crowded surveillance scenes. The searching 
order or rule for an optimal solution is thus quite difficult 
to design. In the perspective of energy minimization, there 
exists quite a few local minimums, e.g., track ID switching, 
in the search for energy minimums. Therefore, unlike the 
deterministic or heuristic searching in the previous work of 
trajectory inference (Tl, 1441 . we design a stochastic sampling 
algorithm using the Markov Chain Monte Carlo (MCMC) 
mechanism ISOl to explore the solution space. In literature, 
some work Ea shows great results on solving spatio-temporal 
data association by an MCMC-based algorithm. In our method, 
we adopt an MCMC-based cluster sampling method, namely 
Swendsen-Wang Cut 0, for optimal solution exploration. The 
algorithm iterates between two types of MCMC dynamics for 
the spatial graph partitioning and temporal graph matching 
respectively. 

Compared with some recently proposed approaches ll26l . 
ia which also adopt stochastic inference for trajectory anal¬ 
ysis, the major advantages of the proposed method are as 
follows. (1) We adopt two types of MCMC dynamics to 
iteratively solve the video object segmentation and tracking, 
which are mutually conditional and closely coupled. This 
algorithm is able to explore the global optimal solution and 
eliminate the need for good initializations. (2) The proposed 
composite feature provides a flexible and robust representation 
against scene clutters and object geometric deformations in 
tracking. (3) We apply our method to various challenging 
surveillance videos from several public datasets and show that 
it outperforms other approaches. 

This paper is organized as follows. We first introduce the 
problem representation and formulation in Section [II] and Sec¬ 
tion Then Section [^presents the algorithm for trajectory 
inference, and Section |v[ describes the implementation details 
and the system fiow. A set of experiments and comparisons 
are proposed in Secti on [Vl| and the paper is concluded with 
discussions in Section I VIII 

II. Problem Representation 

Given an input video, we set the observed window spanned 
over r frames for each computation of trajectory analysis. 
The observed window is moving with a step-size of rj frames. 
Using a state-of-the-art background modeling algorithm 1^ . 



Fig. 3. The composite feature bundling SURF points and MSER regions. 
The moving target tracked by a black bounding box in (a) can be represented 
by the composite features in (b), where the blue ellipses indicate the MSER 
regions and the red crosses indicate the SURF points. Note we discard the 
MSER regions having heavy overlap or without SURF points included. 

the image lattice At, t = 1,..., r of each frame is ini¬ 
tially partitioned into foreground and background domains 
At = Af U Af. The trajectory analysis takes the foreground 
domain as the input, although the background subtraction is 
not perfect, i.e. occurring false alarm regions. We then propose 
a novel image feature, namely the composite feature, extracted 
from the foreground domain, based on which a spatial graph 
and a temporal graph are constructed. Each vertex in the spatial 
graph is a composite feature and each vertex in the temporal 
graph represents a segmented moving target. In the following, 
we start by introducing the composite features, then define 
the problem of trajectory analysis via graph representation, 
and present the probabilistic formulation. 

A. Composite Features 

For representing moving objects, we propose a composite 
image feature that bundles a region with several key points for 
improving both discrimination and robustness. The proposed 
composite feature involves two popular features: the point 
feature SURF O and the region feature MSER 1^ . The 
SURF keypoint exploits scale-space extrema by determination 
of Hessian matrix and employs integral image for rapid 
computation. The MSER feature is defined by an extremal 
property of its intensity function in the ellipse region and 
on its outer boundary. Both of these two features are robust 
against viewing angle, scale, and illumination changes. Some 
extracted SURF points and MSER regions are shown in Eig.[^ 
(b). 

Given a foreground image domain Af, we first detect the 
point and region features, denoted by S' = and R = {vj} 
respectively. We allow overlaps among the region features, and 
discard those with large size, i.e. those containing others or 
spanning half the size of the foreground domain. A composite 
feature Zj is then defined as 

G S}},rj G R,Sj C S, (1) 

where Si oc rj indicates that the point feature Si exists inside 
the region feature rj. The composite feature including no 
SURE points will be removed automatically. In practice, the 
number of SURE points in each composite feature is 5 ^ 10. 
A moving target represented by the composite features is 
illustrated in Eig. 

The measuring energy E{ZajZi)) of two composite features 
Za and Zi) includes two terms: independence similarity and 
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Fig. 4. An example of the measuring configuration consistency of two 
composite features. We denote the MSER region by the ellipse, the SURF 
points by the red cross, and the centroid of the feature by the black spot. 
For the left composite feature, its relative order for the configuration is: 
{1,2,3,4,5}, and for the right one, its relative order is {1,5,3,4,2}. 

Thus, the configuration consistency of these two composite features is: 
l+O+l + l+O ^ Q Q 
5 


configuration consistency. 


E{Za, Zij) = E/ + XgEc, (2) 


where Xg is a weighted parameter for the two terms. 

(I) The independence similarity Ej is based on the matching 
distance of two region features. The energy of this term is 
defined as, 


Ei{Za,Zb) = \\h{ra)-h{n)\\\ (3) 


where h{') is the descriptor for SURF feature. 

(II) The configuration consistency Eg performs a weak 
geometric verification between two composite features. Let 
{si CA Sj^Si G Sa^Sj G Sb} denotes the set of matched 
feature pairs of two composite features and Z^. This 
set can be quickly calculated by matching SURF points in 
a greedy manner: searching the best match for each point 
in region of the corresponding composite feature. We define 
their configuration consistency based on the relative order 
with point matching. Given the centroid of region feature, the 
relative order of inside points can be determined according 
to their spatial distance to the centroid. As Fig. illustrates, 
we number the points in the left based on the spatial distance 
to the centroid, i.e. 1,2,3,4,5; the numbers of points in the 
right is propagated from the left points based on the matching 
correspondence. And the consistency can be computed as. 


EciZa, Zb) 


^0{s^) = 0{sj)) 

|{5i GA Sj}\ 


(4) 


where O denotes the relative order of the points, !(•) is the 
indicator function, and GA 5y}| is the number of matched 
point pairs. The unmatched point pairs are not taken into 
account in the definition because the appearance dissimilarity 
has been penalized by the first term Ej in Eqn. Specifically, 
the cost by Ej would be relatively large with respect to the 
Eg, if the numbers of points are discrepant (e.g., 5 v.s. 10). 
Moreover, to make this consistency penalty smooth and gentle, 
we can additionally apply the sigmoid function on the relative 
order computation. 

We observe that, unlike a single type of features, a com¬ 
posite feature provides a fiexible and stable representation 
that captures the distinctive image primitives as well as the 
geometric structure. 



Fig. 5. The graph representations for trajectory analysis, (a) shows the 
input video sequence (b) shows the foreground mask for frame 

It. (c) illustrates the spatial attribute graph of the currently observed frame 
It, where each graph vertex denotes a composite feature of the foreground 
domain and has four bonds connecting to neighboring vertices. The graph 
edges imply the motion and appearance consistency between two adjacent 
vertices. The edges between the foreground and background domains are 
turned off automatically, (d) illustrates the temporal attribute graph with the 
vertices being the connected clusters of spatial graph vertices. Each temporal 
graph vertex indicates an underlying target. The edges in the temporal graph 
represent the matching correspondences over frames. Note that the vertices in 
the bottom row in (d) indicate the unmatched regions and have no temporal 
connections. 


B. Trajectory Analysis via Graph Representation 

Given the observed window, i.e. a period of frames I[o,r]^ 
we extract the composite features {Zt^ifi = 0,..., r} on the 
foreground areas . Then we obtain a set of spatial graphs 
where each composite feature is treated as the graph 
vertex Vt^i = Zt^ifi = 0,... r. 

The goal of trajectory analysis is to segment moving targets 
and recover their correspondences in each frame. With the 
graph representation, this problem is posed as a joint task of 
graph partitioning and matching. 

I. Spatial graph partitioning is to segment targets over 
a time span r. As illustrated in Fig. (b), we represent the 
partition of the observed frames as n[o,r ]5 

n[o,T] = Ut;t = o,i,2,...,T} 

nt = = 0,1,2,..., Kt}, ( 5 ) 

where Kt is the target number at time t, and Ut^o indicates 
the false alarm regions, i.e. not target regions but proposed as 
the foreground. Each moving target Ut^i at time t is described 
by a bounding box, 

Ut^i 1 , 2 ,..., Kf ( 6 ) 

where {xt^i,yt,i) denotes the target center and {wt^i,ht^i) 
denotes the width and height. The initial foreground domain 
A[ consists of the target image domains Af • and false alarm 
domains Afg, 

Af=U<U<o- (7) 

i=l 

We solve the foreground partitioning 11^ with a spatial 
graph representation (as shown in Fig.[^(c)), defined over the 
foreground image lattice with nearest 4 neighbor connections, 
Gf = (U/,£’f), where Vf is the set of graph vertices and 
Ef is the set of link edges connecting neighboring graph 
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vertices. Each spatial graph vertex v^- = G Vf 

includes one composite feature Zt^i and the corresponding 
label indicating the vertex belongs to certain 

target or false alarm. Therefore, each target Ut^i at time t 
corresponds to a set of connected graph vertices C Vf. 
We solve the task of graph partitioning by turning off edges, 
i.e., generating disjoint subgraphs, which will be introduced 
in Section ITV-AI 

II. Temporal graph matching is recovering the correspon¬ 
dences of targets over time span r. We represent a set of 
matching matrices by ^[o,r ]5 

^[o.t] = = -1} (8) 

where each matrix describes a mapping relation from the 
t-th frame to the {t + l)-th frame. A target matching to 0 
indicates that it is occluded or moving out at the current frame 
(i.e. being “killed”), while a target with no matches in previous 
frames indicates that it is newly appearing (i.e. being “born”). 

As illustrated in Fig. (d), a temporal graph = 
(y^, is defined for moving targets. Each temporal graph 
vertex vf- = ^ includes a moving target Ut^i 

and its matching label at time t. Each edge indicates the 
matching relation of two vertices between adjacent frames, 
as et,i = {< Va,Vb >: Va,Vb G V'^, < Va,Vb >e -Bf}. 
Since we have performed partitioning on the spatial graph, we 
can reasonably assume one-to-one mapping between temporal 
nodes. Note that unmatched nodes are allowed to stand alone, 
caused by false alarm regions from the background subtrac¬ 
tion. In Fig. [^d), the blobs with different colors represent 
the temporal graph nodes and the dotted ones indicate the 
unmatched regions. 

Therefore, for the problem of trajectory completion, we de¬ 
fine the following solution representation W from the observed 

-^[0,r] 

^[0,t] = {^[0,r], n[o,r], ^[0,r] (9) 

where Ar[o,r] denotes the foreground target number in time 
span r, ^[o,r] denotes the partition result for each frame, and 
^[o,r] denotes the matching correspondences of moving targets 
between adjacent frames in the form of matrix mapping from 
one target to another. 

Equivalently, the solution configuration of trajectory com¬ 
pletion can also be represented by N motion trajectories, also 
called “cables” in PfOl , 

^[o,r] = = (10) 

where Cq represents the false alarm regions, and other cable 
represents the trajectory of a foreground moving target, re¬ 
spectively. This representation makes it simple to define the 
motion models. 

Ci = {ti,b,ti,d,{Ut,i;t €[ti^b,ti,d]});i = l,...,N, (11) 

Co = {Ut,o;t = 0,l,...,T}, (12) 

where and ti^d denotes the birth time and death time of 
the trajectory Ci, respectively. 

Therefore, in the probabilistic formulation in Section we 
shall be able to switch between the two notations above. 


III. Probabilistic Formulation 

Based on the definition of solution W, we can formulate 
the inference problem in a Bayesian framework, and the 
optimal solution W* can be solved by maximizing a posterior 
probability, 

W^[o,t] = argnmxp(iy[o,^]|/[o,^]) (13) 

= arg umxp{I[o,T] I W^[o,t] ; ^)p{W[o,t]\0), 

where P and 0 are the parameters for the likelihood and prior 
models respectively. 


A. Prior model 

We define prior model p(IE[o,r]|^) on scene contexts, 
which provide informative guidance for graph partitioning and 
matching, as 


P(^[0,r]|^)^(n[0,r]) •P(^[0,r])- (14) 


Note that each probability term is assumed to be independent, 
since they can be calculated irrelatively. 

I. Partition prior p(n[o,r]) We assume each frame is 
separately segmented and define the prior as. 


r T Kt 

p(n[o,r]) 


t=0 t=0 i=0 


(15) 


Instead of using the Potts model as a partition prior in previous 
work ED, we predict the target location and size according 
to the scene surface property and information of camera 
calibration. 

According to the research of using geometric context (321, 
the object size in the image plane is correlated with the 
physical size (in the real world) according to the scene 
geometric information, i.e. the camera parameters and the 
ground plane. The scene geometry can be roughly estimated 
in an interactive manner in a surveillance system according 
to a recent work (251 . We can then employ the informative 
prior of target size in the image plane, if the tracked targets 
belong to a specific object category. In other words, the prior 
distribution of target size is conditional on the target location 
in the image. In this work, considering the requirement of 
real-time processing, it is not practical to integrate the target 
recognition in the trajectory analysis, and we thus make the 
assumption that the semantic label of targets is specified in a 
certain scene. In fact, this assumption is reasonable, e.g., the 
indoor surveillance systems usually aim at about people while 
the outdoor systems usually track vehicles. 

Fig. (a) illustrates the location-size prediction with scene 
geometry. Let B and C denote the top and the bottom of 
car, A the intersection of the car and the horizon line in 
the image plane, and D the vertical vanishing point. Besides, 
let hp denote the car height and he the camera height. The 
expected size of an observed vehicle on the ground plane can 
be predicted by simply following the cross ratio theorem. 


BC ,DC _ hp 
BA ' DA ~ hc-hp 


(16) 
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Therefore we can obtain the target size distribution with re¬ 
spect to the target location fc{h,w\x,y). Suppose the location 
of target Ut^i is and the partition prior can be thus 

written as 


p{Ut,i) oc fc{h,w\x = Xt,i,y = yt,i). (17) 

An example of predicting sizes of vehicles in the surveillance 
scene is presented in Fig.[^(b), where we sample vehicle sizes 
from fc{h,w\x,y). 



Fig. 6. Location-size constraint, (a) The target size in the surveillance image 
can be directly estimated according to the homography between the image 
plane and the ground plane; (b) We show an example of predicting vehicle 
sizes in the image as the prior information. 



Fig. 7. A statistical path model for defining the matching prior on trajectory, 
(a) shows the statistical birth and death positions of moving targets in the 
scene; (b) shows the reference trajectories in the scene. 


B. Likelihood Model 

The likelihood model p(/[o,r] |^[o,r];/^) includes the two 
following aspects: (i) the region appearances fitting with the 
background model, and (ii) the appearance consistency of the 
trajectories. 

r 

p(r[0,x]|W"[0,r];/3) = nAAf|7rt,B) (21) 

t=0 

N 

•np(A(c'i)iG), 

where Af denotes the image domain of the foreground and 
B the background model proposed by |[38l . A(Ci) indicates 
the image domain covered by trajectory Q, i.e. the moving 
target Ui over t frames. The appearance consistency of the tra¬ 
jectories p{K{Ci)\Ci) is equivalent to the matching similarity 
between targets over frames, as 


II. Matching prior on trajectory p(^[o,r]) Fof simplic¬ 
ity, we use the cable representation to define this prior model, 
which includes two terms: (i) the birth, death, length (lifespan) 
of the cable, and (ii) trajectory shape of the cable. Thus, we 
have the matching prior factorized to obtain the following 
probability terms. 


N 


p(^[o,t]) 

^ _n 

(18) 

p{Ci) 

l — u 

(19) 


where Ci represents the i-th target trajectory. The first term 
p{ti,b:ti,d) gives the prior distribution of birth/death on the 
global trajectory as shown in Fig. [7] (a). F^ denotes the 
trajectory shape, i.e. the curve of the trajectory. The second 
term p(r^, 7^) is a global motion prior based on a path model 
IZ, which consists of a set of reference trajectories {F}, as 
shown in Fig|^ (b). We can learn these reference trajectories 
by clustering in a supervised way according to the method 
reported by Wang et al ll40l . Then the motion prior is in the 
form of a mixture model plus a robust statistic, as 

p{Ti,lZ) (X exp{- min A{Ti, Tj) -h e}, (20) 

Fjen 

where the function A(-) denotes the geometrical distance ifTTIl 
between the shapes of two trajectories, and e is a tuning 
parameter for robustness. 


p{MQ)\q) 




t —, b 


( 22 ) 


where A^^ • and Af • denote the image domains of adjacent 
targets. The target matching can be further calculated by 
measuring the composite features of the targets. 




-E 




E{Zi,Zj 


\Ut,^ 


(23) 


where E{vi,Vj) is the distance metrics between two composite 
features, as defined in Eqn|^ \Ut,i\ denotes the total number 
of extracted features in the target. 


IV. Inference Algorithm 

Given the spatial and temporal graph representations, the 
problem of trajectory recovery is posed as two coupled tasks 
of spatial graph partitioning and temporal graph match¬ 

ing ^[o,r]- In this section, we discuss a stochastic sampling 
algorithm to jointly solve the two tasks. 

The reasons of using stochastic scheme rather than other 
deterministic optimization methods, e.g. Belief Propagation, 
or Graph-cuts, are as follow. (1) It is difficult to design 
fast searching rules due to the unpredictable variance and 
ambiguity of tracked targets. (2) The probabilistic formulation 
is a non-convex representation. (3) We usually cannot obtain 
the reliable initialization for trajectory analysis. 
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The proposed stochastic inference algorithm, designed un¬ 
der the Metropolis-Hasting mechanism 1301 . is able to effi¬ 
ciently seek the optimal solution H^[o,rl from the posterior 


probability p(H^[o,r] |I[o,r]) defined in Eqn. 13 


T^fo.x]~AW"[0,r]|I[0.x]). (24) 


We simulate a ergodic and aperiodic Markov Chain in which 
the algorithm visits a sequence of states in the joint space of 
{n[o,r] 5 ^[o,r]} over the time span r. Specifically, the sampling 
process iterates between two types of Markov Chain Monte 
Carlo (MCMC) dynamics and infers the graph partitioning 
n[o,r] and graph matching ^[o,r] respectively. There are two 
components working in the iterative manner as follows: 

• Fixing the current state of graph matching ^[o,r ]5 we 
perform cluster sampling to explore the new solutions 
of graph partition n[o,r]- 

• Fixing the current state of graph partition n[o,r ]5 we 
update the graph matching state ^[o,r] by changing the 
matching relations of objects in the trajectories. 

In both two components, each sampling is achieved by re¬ 
alizing a reversible jump (i.e. operator) between any two 
successive states to explore new solutions, for either graph 
partitioning or graph matching. The acceptance of a new 
state is decided based on a Metropolis-Hastings 1301 decision 
to guarantee the convergence of the inference algorithm. In 
general, given two successive states A and B for either 
partitioning or matching, the acceptance rate is defined as: 


a{A B) = min 


/ Q{B^A)p{B) \ 

\ ^ Q{A ^ B)p{A) J ^ 


(25) 


where p{A) and p{B) are the posterior probability of IF[o,r] 
defined in Eqn. Q{B ^ A) is the proposal probability to 
drive the state transition from B to A and conversely, Q{A ^ 
B) is the proposal probability from state A to B. 

How to design the proposal probability for driving the 
solution state transition is a non-trivial task that was addressed 
by a branch of works in literature 0, ||45]| . 1211 . Recently, a 
MCMC-based cluster sampling algorithm, namely “Swendsen- 
Wang Cut”(SWC), is proposed for image segmentation , which 
is able to simplify the calculation of the ratio of proposal 
probability in graphical models. We refer to 0 for 

the theoretical background. 

In the following, we will discuss, respectively, the cluster 
sampling algorithm for graph partitioning and graph matching. 



Fig. 8. Three typical solution states in spatial graph. At each stage of 
sampling for spatial graph partitioning, a connected cluster, CC, is generated 
by turning edges off and then to be re-Iabeled for new solution states. 






X cj> X r 




Fig. 9. Illustration of the inference in the temporal graph, (a) The connected 
cluster is generated by probabilistically turning off the edge connection, (b)- 
(e) show the solution state transition by different reversible jumps. 


A. Sampling for Spatial Graph Partitioning 

Given a spatial graph Gf extracted in the observed frame 
It A ^ we utilize the SWC sampling for the graph 

partition inference. The algorithm achieves a reversible jump 
between two states in the solution space including the follow¬ 
ing two steps. 

Step 1. We generate a connected cluster by probabilistically 
turning off the edge links in the graph. 

In the spatial graph Gf = {Vf, F’f), suppose that Vf is the 
set of graph vertices specifying the composite features and 
is the set of edges connecting neighboring graph vertices, as 
shown in Fig.[^(c). For notation simplicity, we omit the time 
stamp t and the superscript S in the algorithm description. 
For any edge e G , we introduce an auxiliary random 
variable pe = {on|off}, i.e. the connecting variable, which 
indicates whether the edge is turned on or off. The edge turn¬ 
on probability qe is defined according to the similarity of the 
two connected vertices. 


qe = p{p.e = On\Va, Vb), (26) 

where Va and are two graph vertices connected by the 
edge e. We collect some discriminative appearance and motion 
features (like the color, orientation gradient, and optical flow), 
which form a compact histogram E, i.e. each histogram bin 
indicates a specific feature dimension. For the image domain 
of the vertex, we describe colors by Fuv metrics and pool 
over into 32 bins; the orientation gradients are quantized with 
48 bins, and the optical fiows with 9 bins. For an edge 
e =< Va^vi, >, the tum-on probability Qe of two adjacent 
vertices can be thus defined with their appearance and motion 
consistency, as 


Qe = q{Ue = On\F{Va),F{Vb)) oc (27) 

( ^{F{Va)\\F{Vb)) +^{F{Vb)\\F{Va))\ 

^ I Te J’ 

where K(-) is the Kullback-Feibler divergence between any 
two histograms and Tg is a constant temperature factor. Hence 
each edge is turned off with probability I — Qe (as shown in 
Fig. [^. It is worth mentioning that the turn-on probabilities 
of edges are calculated during the graph extraction before the 
sampling iteration. 

For an arbitrary edge e, we then sample the connecting 
variable pe following the Bernoulli probability. 


Pe Bernoulli{qe). 


(28) 
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Thus, graph vertices connected together by “on” edges form 
a connected cluster (denoted by CC for simplicity), in which 
all vertices will share the same label in partitioning. Usually 
vertices in a CC have similar appearance and thus most likely 
belong to the same object. Fig. illustrates a CC generated 
from different partition states. Note the edge between different 
objects (different colored nodes) are turned off determinis¬ 
tically. Compared to other graph partition algorithm (e.g.. 
Graph-cuts lO) that turns off the edges by analytically finding 
the maximum fiow over edges, the sampling method enables 
us to search for more possible solutions of graph partition. 

Therefore, the ratio of proposal probability in 

Eqn. can be re-factorized as generating and labeling the 
connected cluster, as. 


Q{B-^A) _ q{CC^\B)q{L{CC^)) 
Q{A^B) ~ q{CC^\A)q{L{CC^))^ 
q{CC^\B) ^ EegCB(l-ge) 
q{CC^\A) Eeecjl - qe)’ 


where and CC^ denote the connected cluster generated 
on state A and B, respectively. denotes the set of edges 
that are turned off on state A, and similarly is the turned- 
off edge set on B. Then we discuss the labeling of connected 
component in the next step. 

Step 2. We explore for a new solution of graph partitioning 
by labeling the generated CC. In practice, a few (e.g. 2^5) 
CCs will be generated and we select one of them randomly. 

Assume that the current partition state is 11 = 

{UojU 2 , ... ,Uk} where Uq denotes the background regions 
and Ui^i G [1, Ff] a segmented object. Note that the CC may 
include the vertices from multiple targets. Then we can assign 
the CC a label from 0 to iT to update the partition state by 
three types of reversible jumps. 

• Split-and-merge The CC is extracted from one object 
and merged into another one. The jump between the state 

(a) and (b) is an example as shown in Fig. This jump 
is self-reversible. 

• Split The selected CC is assigned a new label, that is, 
a new object is created. In Fig. from state (a) or (b) to 
state (c) is a “birth” jump. 

• Merge The whole object is selected as a CC and 
merged into another object, as from state (c) to state (a) or 

(b) in Fig. The split jump and merge jump are mutual 
reversible. 


These jumps can be defined in the same form as. 


{L{v)=i,v G CC,i e [l,K]} 

^ {L{v)=i\v G CCC' G [CK]}, (31) 

where L{vj) indicates the label of vertex v. 


B. Sampling for Temporal Graph Matching 

Graph matching sampling in the temporal graph is similar 
with sampling in the spatial graph. Note that the temporal 
sampling may cause state changing in the spatial graph, since 
each segmented object in the spatial graph is a node in the 
temporal graph, as shown in Fig. (d). 


Algorithm 1: The sketch of trajectory analysis 


Input: A period of observed frames [0, r], and [r — 3, r] 
frames are newly input. 

Output: The trajectory analysis solution IU[o,r]- 

1. Construct graphs on new frames. 

(1) Calculate an initial foreground map by the 
background subtraction. 

(2) Extract the composite features by SURF and 
MSER detectors. 

(3) Construct the initial spatial graphs on each frame 
with each composite feature being a vertex. 

(4) Construct the initial temporal graph. 

2. Perform the sampling algorithm with the new frames 
[T-3,r]. 


(1) Eor each frame t G [r — 3, r], loop for 80 
sampling iterations. 

(i) Perform sampling for spatial graph partitioning 
on frame t. 

(ii) Accept the new partition state according to the 
acceptance rate in Eqn. 

(2) Sample the temporal graph matching with frames 
[r — 4, r] in 100 iterations. 

(i) Perform sampling for temporal graph matching. 

(ii) Accept the new matching state according to the 
acceptance rate in Eqn. 

3. Perform the sampling algorithm within the global 
observed period [0,r]. 

Loop for 100 Rounds 

(1) Randomly select 3^5 frames in [0,r], and for 
each frame t loop for 40 sampling iterations. 

(i) Perform sampling for spatial graph partitioning 
on frame t. 

(ii) Accept the new partition state according to the 
acceptance rate in Eqn. 

(2) Sample the temporal graph matching with frames 
[0, r] in 100 iterations. 

(i) Perform sampling for temporal graph matching. 

(ii) Accept the new matching state according to the 


acceptance rate in Eqn. 25 


4. Output the final solution of trajectory analysis IU[o,r]- 


Similarly, we first need to construct the temporal graph 
= (U^,£’^) within the observed period [0,r], and 
calculate the tum-on probabilities of edges G between 
arbitrary neighboring vertices. Recall that each vertex G 
indicates a moving target represented by a bounding box 
as shown in Eqn. We can thus use some simple appearance 
features on the image domains of vertices to define the tum-on 
probability, just similar with the definition in the spatial graph 


shown in Eqn. 27 


In the inference for graph matching, we first randomly select 
one trajectory Ci at the current solution state, which is a bit 
different compared with the inference in the spatial graph. 
And we generate a sub-trajectory as the connected cluster 
CC by probabilistically turning off the edge connections, as 
illustrated in Eig. (a). The 4 types of reversible jumps are 
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then performed to update the solution state. Fig. illustrates 
the transition of solution states. 

• Birth Assigning a new color for the selected CC, 
that is, to create a new cable (trajectory), as illustrated 
in Fig.|9](d). 

• Merge The selected CC is merged into another cable, 
as shown in Fig. (e). In practice, we merge the CC 
with neighboring cables. 

• Death Setting the selected CC as background (false 
alarm), as shown in Fig. 121 (c). 

• Swap This is an important operator in temporal sam¬ 
pling. Given a selected CC, we swap it with another sub¬ 
cable in the same time span. Fig (b) is the succedent 
state of the current state in Fig. [9](a) caused by the this 
operator. 

Assume that N trajectories are traced in the observed period on 
the current state and each vertex v in the trajectory represents 
a moving target. The birth, death, and swap jumps can be 
defined in the same form as, 

{L{v)=i,v eCC,ie [0,A^]} 

^ {L{v)=i',v G CCC' e [0,A^]}, (32) 

where L{v) represents the label of v. The implementation for 
the swap jump is a bit different, since we need to select another 
sub-cable, as 

{L{v) GA L{v'),v e cc,v' e CC'} 

^ {L{v') AA L{v),v' e CC', V e CC}, (33) 

where L{v) AA L{v') represents to swap labels of the two 
vertices. 

We summarize the sketch of the proposed method in 
Algorithm and introduce the detailed implementation in 
Section [Vl 

C. Discussion of Convergence 

The joint space of {n[o,r] ^ ^[o,r]} over the time span r is so 
large that it is prohibitive to search it exhaustively. For exam¬ 
ple, consider a case that there are K spatial graph vertexes and 
N trajectories (moving targets) in average. The solution space 
has in the order of 0{{KN)^). In statistics, we can simplify 
the maximum searching for joint probability by using the 
conditional probability, if the prior is assumed to be weak. This 
inspires us to design the algorithm to iteratively sample the 
conditional probabilities, p(n[o,r] |^[o,r]) and p(^[o,r] |n[o,r ])5 
respectively, with the two MCMC dynamics. The joint solution 
space is then separated into two relatively simple spaces. 

For either solution space of spatial graph partitioning or 
temporal graph matching, the Markov chain is ergodic via 
performing the reversible jumps, based on the Metropolis- 
Hasting mechanism 1301 . As the space is finite, all states 
can be visited following the observation that there is a non¬ 
zero probability for any node to be chosen into the connected 
component and assigned a label by activating the jumps. Then 
the Markov chain can move from a state to any other state 
with non-zero probability in finite steps. 

In our method, we have to limit the number of sampling 
steps for efficiency consideration, as described in Algorithmic 


Then the global convergence is no longer guaranteed and the 
algorithm might obtain a local minimum. Nevertheless, we 
find the experimental results satisfactory due to the following 
reasons. First, the integration of informative prior models, e.g., 
p(n[o,r ])5 effectively accelerates the inference by fast rejecting 
false positive proposals. Second, the cluster sampling is much 
more efficient than traditional simpling methods. The process 
of generating the connected cluster is the key to efficiency 
improvement, in which the discriminative appearance and mo¬ 
tion features are collected for generating effective proposals. 
Moreover, the cluster sampling enlarges the space that the 
stochastic process can possibly visit, and avoids often getting 
stuck in local minimums. An empirical study of inference 
convergence will be introduced in Section |V^ 

V. Implementation 

In this section, we apply our method to a video surveil¬ 
lance system which also involves a background modeling 
module (381, and carry out the experiments with comparisons 
to the state-of-the-art approaches. 

We start by introducing the parameter settings in our exper¬ 
iments. We set the value of the observed time span r = 15 
frames, and we set the observed window moving forward with 
a step-size of = 4 frames. The other related parameters for 
our approach are introduced as follows. 

For the composite feature definition (in Section [I^, the his¬ 
togram of local orientations h{') consists of 72 quantized bins 
and each bin indicates a small range of orientation angles, i.e. 
5 degrees. The weighted parameter Xg for measuring similarity 
of composite features is empirically set as = 0.25. 

For the introduced prior models (in Section [nl| ), we train 
them in an initial stage for each specific surveillance scene. 
The partition prior p(n[o,r ])5 the location-size prediction 
for tracked targets, is obtained by estimating the extrinsic 
camera parameters using an interactive calibration toolkit (25l, 
where we need to label a few parallel lines and tracked 
targets to calculate the vanishing points. Note that we make 
an assumption that the camera is fixed with only one degree 
of freedom, namely its height he. For the matching prior on 
trajectory p(4>[o,r])5 we set the tuning parameter for robustness 
e = 0.135. The geometrical distance of two trajectories A 
is normalized into [0,1]. It is worth mentioning that we are 
allowed to disable these prior models by setting them uniform, 
although they are very effective in applications. 

Given a period of observed frames [0,r], we extract com¬ 
posite features on the newly arriving frames, i.e. 4 frames for 
each sliding window, where we construct the spatial graphs 
and a temporal graph. Note that the initial temporal graph con¬ 
sists of composite features also, since temporarily no moving 
target is segmented in the new frames. In the following, the 
sampling procedure includes two stages: sampling in the new 
frames [r — 3, r] and sampling in the whole observed period 
[0,r]. 

(I) In the first stage, spatial graph partitioning is performed 
and the number of sampling iterations at each frame is 
bounded at 80; vertices (composite features) are grouped 
to indicate potential moving targets due to their consistent 
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appearances and motions. And then we sample the temporal 
graph matching with frames [r — 4, r], where the (r — 4)-th 
frame should be taken into account, since we need to extract 
correspondences between the previous frames and the new 
frames. We set iteration number of the temporal matching 
sampling as 100. 

(II) In the second stage, the spatial graph partitioning and 
temporal graph matching are performed iteratively in a loop. 
The loop is set as 100 rounds, and each round includes two 
sampling iterations. (1) First, a small number (i.e. 3 ^ 5) of 
frames in [0, r] are first randomly selected for graph partition 
sampling, and the number of sampling iterations at each frame 
is bounded at 40. (2) Then we perform matching sampling in 
the observed period [0, r] for 100 iterations. 

VI. Experiments 

We use three public video databases, TRECVID08 (371, 
PETS (H, and LHI (431 , to evaluate our method and compare 
with other state-of-the-arts approaches. These databases are 
very challenging for the multi-target tracking task, including 
scenarios with severe occlusions, scale changes or complex 
background structure. A number of video clips from these 
databases are selected for testing, i.e., 10 videos from LHI, 
8 from PETS and 8 from TRECVID. We manually annotate 
the bounding boxes of targets in the videos as the ground-truth. 
In our method, the types (semantic labels) of tracking targets 
are provided, which serve as the prior information. The videos 
selected from the TRECVID and PETS are all indoor scenes 
and the moving targets are all pedestrians; the videos in LHI 
are captured from outdoor traffic surveillance, and we thus 
track the moving vehicles as the targets. Table |l| summarizes 
the number of frames as well as the number of targets in the 
testing videos. 

All the testing videos are with the frame rate of 15 fps and 
the frame size of 352 x 288 pixels. The experiments are carried 
out on a high-performance workstation with Core Duo 3.0 
GHZ CPU and 8 GB memory. The computational efficiency 
for all steps (as described in Algorithm in our system is 
summarized as follows. On average, the step of constructing 
graphs on new frames costs 80 ~ 100 ms; it costs 300 ^ 450 
ms to perform sampling on new frames, including spatial graph 
partitioning and temporal graph matching; sampling within the 
global observed period costs around 600 ^ 800ms. Recall that 
the algorithm processes 4 newly arriving frames at a time, i.e., 
the observed window is moving with a step-size of 4 frames. 
Thus, our system is capable of processing 3^5 frames per 
second on average. In practice, we can enhance the efficiency 
by reducing the numbers of sampling iterations. 


Database 

No. of Erames 

No. of Targets 

TRECVID 

8972 

389 

PETS 

7409 

194 

LHI 

15213 

436 


TABLE I 

Testing sequences erom public video databases 


A few representative results of trajectory analysis are pro¬ 
posed in Pig. [T^ Most of the video clips are very challenging 


due to the crowded objects, scale changes, severe occlusions 
and low resolution. 

In order to quantitatively evaluate the performance, we 
introduce several object-level benchmark metrics, including 
Recall, Precision, FA/Frm, and SwitchIDS, as shown in 
Table |n| which are also adopted in (20l, (23. In the litera¬ 
ture, some other performance measures have been proposed 
such as Multiple Object Tracking Precision and Accuracy 
(MOTA) IT4l.ll46l. These measures are less evident as they try 
to integrate multiple factors into one scalar valued measure, 
despite giving an overall picture of the performance. We write 
a program to match the results with the ground-truth based on 
these metrics automatically. 

We compare our method with the recently proposed ap¬ 
proaches for similar scenarios C3, HU, da. 


Methods 

Recall 

Precision 

FA/Frm 

SwitchIDs 

Zhao et al. (4^ 

76.2% 

72.7% 

1.31 

12 

Huang et al. H4J 

69.1% 

63.1% 

1.82 

13 

Leibe et al. 1191 

78.9% 

69.4% 

2.01 

9 

The proposed 

83.3% 

79.4% 

0.72 

7 

without priors 

81.3% 

78.2% 

1.10 

8 


TABLE III 

Results on videos erom the TRECVID database 


Methods 

Recall 

Precision 

FA/Frm 

SwitchIDs 

Zhao et al. |[46^ 

82.4% 

79.7% 

0.92 

18 

Huang et al. tl41 

71.1% 

68.5% 

1.98 

14 

Leibe et al. 1191 

79.1% 

73.1% 

1.38 

16 

The proposed 

87.7% 

82.9% 

0.82 

8 

without priors 

86.2% 

79.8% 

1.21 

9 


TABLE IV 

Results on videos erom the PETS database 


Methods 

Recall 

Precision 

FA/Frm 

SwitchIDs 

Huang et al. 1141 

73.2% 

72.6% 

1.27 

14 

Leibe et al. 1191 

79.7% 

73.4% 

1.51 

10 

The proposed 

91.3% 

86.1% 

0.84 

7 

without priors 

90.8% 

82.1% 

1.07 

9 


TABLE V 

Results on videos erom the LHI database 


Table |nl| Table |IVj and Table |V| show the quantitative 
results of our results with the results proposed by Zhao et 
al. (46l , and Huang et al. (T^ . The method by Zhao et 
al. (46l tracks pedestrians with a model-based approach to 
interpret the image observations by multiple partially occluded 
human hypotheses, and thus we only apply this method on 
the TRECVID and PETS databases for human tracking. The 
results show that our method achieves the best performance, 
greater Recall, greater Precision, fewer EA/Erm, and fewer 
SwitchIDs. To illustrate the benefits of using informative priors 
in trajectory analysis, we also report the system performances 
in the setting of disabling the prior components. The analysis 
of these experiments are presented as follows. 

1) Using deferred frames for global inference, i.e. an 
observed window, is very helpful, which provides us 
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Fig. 10. Several representative tracking results on the public datasets. 


Metric 

Definition 

Recall 

Frame-based correctly matched targets / total ground-truth targets 

Precision 

Frame-based correctly matched targets / total output targets 

FA/Frm 

Frame-based number of false alarms per frame 

SwitchIDS 

The Number of times that the track IDs of two targets switch 


TABLE II 

Evaluation Metrics 




(a)Tests on PETS dataset (b)Tests on LHI dataset 


Fig. II. The curves of Average Tracing Rate (ATR) for our trajectory analysis result and comparisons. The horizontal axis of ATR represents the coverage 
rate of the traced trajectory compared to the ground-truth; the vertical axis represents the proportion of trajectory length. In this evaluation, we compare our 
method with two other MCMC-based approaches: MCMC Data Association (MCMCDA) (43 and Trajectory Parsing (^. The curves on the left are tested 
on the PETS dataset, and the curves on the right are tested on the LHI dataset. 


with more information to handle occlusions and mutual 
interactions. 

2) The prior components, e.g. the location-size prediction, 


gives very important cues for segmenting conglutinated 
targets; they effectively reduce the false alarms. 

3) The matching prior on trajectory, birth, death, lifespan of 
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the cable, and shape of the cable, are strong constraints 
particularly for tracking vehicles in the traffic surveil¬ 
lance scene, since the motions of vehicles are usually 
regular in a certain scene. 

4) In the PETS dataset, many pedestrians have very sim¬ 
ilar appearances (e.g. in black coats) or motions (e.g. 
walking together), despite which the iterative sampling 
algorithm is shown to effectively reduce the number of 
SwitchIDs. 


In addition, we propose a novel benchmark metric to 
evaluate the trajectory-level performance, namely Average 
Tracing Rate (ATR), which is defined as the ratio of the 
traced trajectory length with respect to the ground-truth. The 
horizontal axis of ATR represents the coverage rate of the 
traced trajectory compared to the ground-truth of the testing 
videos; the vertical axis represents the proportion of trajectory 
length. The ATR for a result of trajectory analysis is in the 
form of a spot-curve for a discretized level of evaluation. 
This metric is very intuitive and straightforward to visualize 
the consistency of the tracking trajectories. In Fig. [TT] we 


propose the ATR curves of our method on the three datasets. 
In this evaluation, we compare with two other MCMC-based 
stochastic approaches for trajectory analysis, MCMC Data 
Association (MCMC) by Yu et al. ||45]| and Trajectory Parsing 
by Liu et al. 1^ . 

To further analyze the algorithm convergence, we present 
an empirical study on visualizing the output energy in infer¬ 
ence. Here the output energy, — log(p(IE[o,r] |^[o,r ]))5 is the 
logarithm of posterior probability within an observed period 
[0, r]. In Fig. 12 (a), for an arbitrary period, we compare with 


Gibbs sampling for the trajectory analysis. For comparison, 
we replace the cluster sampling method at each step by the 
traditional Gibbs sampler ca in the algorithm. We observe 
that the cluster sampling converges significantly faster. More¬ 
over, we investigate the output energies with respect to the two 
important parameters in our system, the observed period length 
r and the forward step-size r]. This experiment is also carried 
out within a period of observed frames. We first fix 77 = 4 
and discretely increase r by 5 scales: r = 15, 20, 25, 30, 35. 
That is, we increase the length of period and deal with more 
video frames in inference. Then we increase 77 = 4,6,8,10,12 
with fixed r = 15, to gradually reduce the overlap with 
the previous inference. The empirical results are reported in 
Fig. (b), where the horizontal axis represents the scale for 
either parameter. 


VII. Conclusion 

The objective of this paper is to track multiple video targets 
and recover their trajectories, against occlusion, interruption, 
and background clutter. Compared with the previous methods 
in literature, the main contributions of this paper are as follows. 
First, we propose a novel unified framework of trajectory 
analysis to together solve spatial graph partitioning and tem¬ 
poral graph matching. Second, a robust composite feature 
bundling the MSER feature and SURF feature is presented for 
the affinity model of moving targets, against scale transition 
and non-rigid motion. Third, we design a stochastic sampling 


algorithm to iteratively solve the spatial graph partition and 
temporal graph matching. This algorithm is designed under 
the Metropolis-Hastings method without the need for good 
initializations. 

We have applied our method in an intelligence video 
system and found satisfactory performance. In experiments, 
our method is tested on several challenging videos from 
the public video databases of visual surveillance, including 
TRECVID, PETS, and LHI, and it outperforms the state-of- 
the-art methods. 

In future work, it is important to integrate object recog¬ 
nition 1 ^ into the trajectory analysis, which will lead to 
a more general solution for video surveillance applications. 
In addition, we plan to study the parallel implement for the 
MCMC-based inference to further improve the computation 
efficiency. 
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Fig. 12. Empirical study of algorithm convergence, (a) visualizes the energies (the vertical axis) of every 10 iterations (the horizontal axis), i.e. 
— log(p(VF[Q |/[o,r]))’ within an observed period [0,t]. The red curve and green curve in (a), respectively, represent the energies for our algorithm 
and the traditional Gibbs sampler, (b) shows the output converged energies with different parameters, the observed period length r and the forward step-size 
r]. The blue curve represents the converged energies with fixed r] = A and increased r: r = 15, 20, 25, 30, 35. The red curve represents with fixed r = 15 
and increased rj: r] = A, 6, 8,10,12. 
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